Self-supervised learning

What's self-supervised learning?

What's up with self-supervised learning? I've been seeing self-supervised learning everywhere lately.

From Yann LeCun, (Twitter, April 2019):

I Now call it "self-supervised learning", because "unsupervised" is both a loaded and confusing term. In self-supervised learning, the system learns to predict part of its input from other parts of it input.

We may consider self-supervised learning as a subset of unsupervised learning, which focuses on tasks related to predicting the input from different 'views' or 'parts' of itself. In a way it is Supervised Learning with no external labels.

For example:

Natural Language Processing: predict the next word in the sentence. Our labels are the input itself, in this case the next word. If our input is a sentence like: "hello this is a sentence":
1. 'hello ___', label is this
2. 'hello this ___' label is is
3. 'hello this is ___' label is a
4. ...
Computer Vision: image colorization. The label is the colorized (original) image and input is the grayscaled image (original converted to grayscale).

But why?

Why use self-supervised learning?

Data scarcity:

Labelled data is expensive and limited.

This becomes especially apparent when applying Deep Learning to new domains. Consider tackling a new medical imaging problem, such as identifying skin cancer. It is not a trivial task to obtain thousands of labels which require expert input.

Expressive

The input typically has much richer structure than a single sparse label. This makes for more generalizable, less delicate models.

From DeepMind https://deepmind.com/blog/article/unsupervised-learning:
A key motivation for unsupervised learning is that, while the data passed to learning algorithms is extremely rich in internal structure (e.g., images, videos and text), the targets and rewards used for training are typically very sparse (e.g., the label ‘dog’ referring to that particularly protean species, or a single one or zero to denote success or failure in a game). This suggests that the bulk of what is learned by an algorithm must consist of understanding the data itself, rather than applying that understanding to particular tasks.

Easy

No one in their right mind will label every single frame of a video dataset. You could however, train a network to colorize the video, or increase its framerate.

Task agnostic

The most important attribute of neural networks is their modularity. As self-supervised models learn structure they are easier to repurpose. For example in NLP, training a model that understands language structure may be repurposed for multiple tasks along the pipeline.

But why not both?

We have labelled data, we have unlabelled data. Why not use both? Enter semi-supervised learning. It appears that using a smaller set of labelled data (an order of magnitude less), and self-supervised training works well.

Really well in fact... see here for CPC method

'Data-Efficient Image Recognition with Contrastive Predictive Coding' ¹

Methods

Current state of the art self-supervised methods may be split into two main groups, generative and contrastive

Generative learning

Generative self-supervised methods as the name suggests, focus on generative tasks. The learning process involves comparing the generated outputs with the original input.

Colorization - The original input {x_i} is fed through a colourization network c_i = f(g(x_i)) where g(x_i) is the grayscaled image and c_i is the generated colourized image. The loss is applied to the colourized output to enable training L(x_i, c_i)

Autoencoders - The most classic example of the neural network learning to rebuild the input. Similarly we have L(x_i, x^{'}_{i}) where x^{'}_{i} = f(x_i).

Language Modelling - Any NLP generative task involving the prediction of any part of the input from any other part. For example next-word prediction, previous-word prediction, next-sentence prediction

Contrastive learning

Contrastive learning as the name suggests refers to unsupervised (self-supervised) learning objectives involving contrasting similarities between inputs. That is, high similarity between positive pairs and inversely, low similarity for negative pairs:

Contrastive Predictive Encoding 2.0 (2019)

Contrastive Predictive Encoding (CPC) trains a neural network by predicting future observation representations from those of the past. It splits the input into patches, obtains feature vectors from each patch (think single representation vector per patch) and then tries to predict future feature vectors given current feature vectors. The contrastive element is that the network is given a 'real' future feature vector and multiple 'fake' (unrelated) feature vectors.

Data-Efficient Image Recognition With Contrastive Predictive Coding ¹ - i is set to be in the center of the image, predicting the feature vectors below the center.

The contrastive task is as follows:

x_{i,j} - Divide input into a set of overlapping patches
z_{i,j} = f_{\theta}(x_{i, j}) - Encode all patches x_{i,j} using a feature extractor f_{\theta} to obtain a feature vector per patch. Thin gray vectors in the image above.
c_{i,j} = g_{context}(z_{i,j}) - Mask each feature vector such that the receptive field of a given output neuron can only see inputs that lie above it in the image
\hat{z}_{i+k, j} = W_k c_{i, j} - Predicted future feature vector (k > 0). W_k is the prediction matrix
\{z_l\} - take negative samples from other patches in the image or from other images
L_{CPC} = - \sum_{i,j,k} log(p(z_{i+k, j} | \hat{z}_{i+k,j}, \{z_l\})) = - \sum_{i,j,k} log(\frac{exp(\hat{z}_{i+k,j}^{T} z_{i+k, j})}{exp(\hat{z}_{i+k,j}^{T} z_{i+k, j}) + \sum_{l} exp(\hat{z}_{i+k,j}^{T} z_{l}^{'})})
the InfoNCE loss function. The network must correctly classify the target 'future' representation amongst a set of unrelated negative representations

Intuition:

The aim of this self-supervised method is to learn rich representations and enforce the encoding of shared information between different parts of the same image.
As we predict further in the future, the amount of shared information becomes much lower. However, since the same object is being captured the model learns to infer information about its global structure.
Note the loss function is applied to the representations themselves unlike generative methods which focus on the generated output at the pixel level. This results in richer abstract latent factors.

Results:

Removing the context layers (brown in the image below), freezing the feature extraction weights and adding a linear classifier on top, the network can be fine-tuned into a powerful supervised classification network.

The results of CPC are shown in the performance graph above.

Other um... learning

Self-supervision is only limited to what auxiliary tasks we can think of. It is important to consider however, how transferable such auxiliary tasks are to other domains.

Rotation - the network learns to predict what rotation was applied.

Unsupervised Representation Learning By Predicting Image Rotations ²

Solving a jigsaw puzzle - image is split into a grid and shuffled. Network learns the correct order of the grid. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles ³

Footnote

Here is a list of self-supervised learning papers for more information.

References:

📄 - Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Ali Eslami: “Data-Efficient Image Recognition with Contrastive Predictive Coding”, 2019↩
📄 - Spyros Gidaris, Praveer Singh: “Unsupervised Representation Learning by Predicting Image Rotations”, 2018↩
📄 - Mehdi Noroozi: “Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles”, 2016↩